A Study of Ranking Schemes in Internet-Scale Code Search
نویسندگان
چکیده
The large availability of source code on the Internet is enabling the emergence of specialized search engines that retrieve source code in response to a query. The ability to perform search at this scale amplifies some of the problems that also exist when search is performed at single-project level. Specifically, the number of hits can be several orders of magnitude higher, and the variety of conventions much broader. Finding information is only the first step of a search engine. In the case of source code, a method as simple as ‘grep’ will yield results. The second, and more difficult, step is to present the results using some measure of relevance with respect to the terms being searched. We present an assessment of 4 heuristics for ranking code search results. This assessment was performed using Sourcerer, a search engine for open source code that extracts fine-grained structural information from the code. Results are reported involving 1,555 open source Java projects, corresponding to 254 thousand classes and 17 million LOCs. Of the schemes compared, the scheme that produced the best search results was one consisting of a combination of (a) the standard TF-IDF technique over Fully Qualified Names (FQNs) of code entities, with (b) a ‘boosting’ factor for terms found towards the right-most handside of FQNs, and (c) a composition with a graph-rank algorithm that identifies popular classes. A Study of Ranking Schemes in Internet-Scale Code Search Sushil Bajracharya∗, Trung Ngo∗, Erik Linstead†, Paul Rigor†, Yimeng Dou†, Pierre Baldi†, Cristina Lopes∗ ∗Institute for Software Research †Institute for Genomics and Bioinformatics {sbajrach,trungcn,elinstea,prigor,ydou,pfbaldi,lopes}@ics.uci.edu Institute for Software Research University of California, Irvine Irvine, CA 92697-3423 ISR Technical Report # UCI-ISR-07-08 November 2007 Abstract. The large availability of source code on the Internet is enabling the emergence of specialized search engines that retrieve source code in response to a query. The ability to perform search at this scale amplifies some of the problems that also exist when search is performed at single-project level. Specifically, the number of hits can be several orders of magnitude higher, and the variety of conventions much broader. Finding information is only the first step of a search engine. In the case of source code, a method as simple as ‘grep’ will yield results. The second, and more difficult, step is to present the results using some measure of relevance with respect to the terms being searched. We present an assessment of 4 heuristics for ranking code search results. This assessment was performed using Sourcerer, a search engine for open source code that extracts fine-grained structural information from the code. Results are reported involving 1,555 open source Java projects, corresponding to 254 thousand classes and 17 million LOCs. Of the schemes compared, the scheme that produced the best search results was one consisting of a combination of (a) the standard TF-IDF technique over Fully Qualified Names (FQNs) of code entities, with (b) a ‘boosting’ factor for terms found towards the right-most handside of FQNs, and (c) a composition with a graph-rank algorithm that identifies popular classes. The large availability of source code on the Internet is enabling the emergence of specialized search engines that retrieve source code in response to a query. The ability to perform search at this scale amplifies some of the problems that also exist when search is performed at single-project level. Specifically, the number of hits can be several orders of magnitude higher, and the variety of conventions much broader. Finding information is only the first step of a search engine. In the case of source code, a method as simple as ‘grep’ will yield results. The second, and more difficult, step is to present the results using some measure of relevance with respect to the terms being searched. We present an assessment of 4 heuristics for ranking code search results. This assessment was performed using Sourcerer, a search engine for open source code that extracts fine-grained structural information from the code. Results are reported involving 1,555 open source Java projects, corresponding to 254 thousand classes and 17 million LOCs. Of the schemes compared, the scheme that produced the best search results was one consisting of a combination of (a) the standard TF-IDF technique over Fully Qualified Names (FQNs) of code entities, with (b) a ‘boosting’ factor for terms found towards the right-most handside of FQNs, and (c) a composition with a graph-rank algorithm that identifies popular classes.
منابع مشابه
Identifying and Ranking Ethical Issues of the Internet of Things in Medical Sciences using Stepwise Weight Assessment Ratio Analysis
Background and Objectives: The Internet of Things (IoT) refers to billions of physical devices around the world that are now connected to the internet, all collecting and sharing data. The IoT has been widely applied to interconnect available medical resources and provide reliable, effective and smart healthcare service to the people. The social acceptance of IoT applications and services stron...
متن کاملارزیابی ساختار درونی مقیاس شدت سایبرکندوریا: یک مطالعه تحلیل عاملی
Background & Aim: Cyberchondria is a form of anxiety characterized by excessive health- related internet search. The aim of this study was to evaluate the factorial structure of the Persian version of the Cyberchondria severity scale, designed to measure individuals' anxiety about their own health status caused by excessive healt...
متن کاملAccelerating high-order WENO schemes using two heterogeneous GPUs
A double-GPU code is developed to accelerate WENO schemes. The test problem is a compressible viscous flow. The convective terms are discretized using third- to ninth-order WENO schemes and the viscous terms are discretized by the standard fourth-order central scheme. The code written in CUDA programming language is developed by modifying a single-GPU code. The OpenMP library is used for parall...
متن کاملWebometrics-based Analysis and Ranking of Iranian Hospital Websites
Background and Objectives: Active presence of hospitals on the Internet is becoming a hallmark of hospitals’ commitment to quality healthcare services delivery. For insightful planning towards a strong Internet-based information delivery and communication, there is a need for continuous monitoring of hospital website’s status. Built on this need, this paper provides, for the first time, a ranki...
متن کاملTowards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures
Large-scale web-search engines are generally designed for linear text. The linear text representation is suboptimal for audio search, where accuracy can be significantly improved if the search includes alternate recognition candidates, commonly represented as word lattices. This paper proposes a method for indexing word lattices that is suitable for large-scale web-search engines, requiring onl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007